Skip to content

NHN Cloud 제목 추출 보정#18

Merged
SmileJune merged 1 commit into
mainfrom
ai/nhn-title-extraction
Jun 1, 2026
Merged

NHN Cloud 제목 추출 보정#18
SmileJune merged 1 commit into
mainfrom
ai/nhn-title-extraction

Conversation

@SmileJune

@SmileJune SmileJune commented Jun 1, 2026

Copy link
Copy Markdown
Owner

승인된 내용

NHN Cloud Meetup 수집 데이터에서 URL 같은 값이 article title로 저장되지 않도록 제목 추출을 보정합니다.

변경 사항

  • NHN Cloud post API의 postPerLang.title을 정규화해서 제목으로 사용합니다.
  • NHN Cloud Meetup suffix가 중복되지 않도록 처리합니다.
  • http/https URL 또는 domain/path 형태의 제목 후보를 거부합니다.
  • 일반 sitemap HTML 수집에서도 URL 같은 제목 후보만 있으면 article을 skip합니다.
  • sitemap crawler 회귀 테스트를 추가했습니다.
  • 개발 로그에 보정 내용과 검증 결과를 기록했습니다.

의도적으로 제외한 것

  • 데이터베이스 마이그레이션은 없습니다.
  • 운영 DB 직접 수정은 없습니다. 현재 로컬/운영 DB에서 URL 형태 NHN Cloud title은 0건으로 확인했습니다.
  • 검색 랭킹/Elasticsearch 쿼리 변경은 없습니다.
  • 자동 머지는 없습니다.

검증

  • uv run pytest tests/test_sitemap_crawler.py -> 4 passed
  • uv run ruff check app/crawler/sitemap.py tests/conftest.py tests/test_sitemap_crawler.py -> All checks passed
  • 로컬 DB와 운영 DB에서 nhn-cloud-meetup source의 URL 형태 title count가 0건임을 확인

사람이 확인할 방법

  1. PR 브랜치를 체크아웃합니다.
  2. cd apps/backend로 이동합니다.
  3. 위 pytest와 ruff 명령을 실행합니다.
  4. NHN Cloud title이 URL 형태일 때 SkippedArticleError로 처리되는 테스트를 확인합니다.

Summary by CodeRabbit

  • Bug Fixes
    • Improved article title validation to prevent URL-like strings from being stored as titles.
    • Articles without valid titles are now properly skipped rather than using fallback URLs.
    • NHN Cloud Meetup articles receive properly normalized titles with consistent formatting.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d69ea709-4cf4-4a74-8d2b-59d500b07c84

📥 Commits

Reviewing files that changed from the base of the PR and between 3a4bf33 and 46c1417.

📒 Files selected for processing (4)
  • apps/backend/app/crawler/sitemap.py
  • apps/backend/tests/conftest.py
  • apps/backend/tests/test_sitemap_crawler.py
  • docs/development-log.md

📝 Walkthrough

Walkthrough

The PR adds title validation and cleaning logic to the sitemap crawler to prevent URL-like strings from being stored as article titles. New helper functions detect and filter URL-like titles, with NHN Cloud-specific normalization. Both NHN Cloud API extraction and general HTML parsing now raise SkippedArticleError when all title candidates are invalid, rather than falling back to the page URL.

Changes

Article Title Validation and Cleaning for Sitemap Crawler

Layer / File(s) Summary
Title validation and normalization helpers
apps/backend/app/crawler/sitemap.py
Adds constant NHN_CLOUD_MEETUP_TITLE_SUFFIX, introduces is_url_like_title(), clean_article_title(), and normalize_nhn_cloud_title() utilities. Updates first_heading_title() to use clean_article_title() when extracting <h1> text.
NHN Cloud API extraction with title validation
apps/backend/app/crawler/sitemap.py, apps/backend/tests/test_sitemap_crawler.py
extract_nhn_cloud_payload() now normalizes API-provided titles via normalize_nhn_cloud_title() and raises SkippedArticleError for missing/URL-like titles. Tests verify correct title extraction and rejection of URL-like candidates.
General article extraction with title validation
apps/backend/app/crawler/sitemap.py, apps/backend/tests/test_sitemap_crawler.py
extract_article_payload() tries multiple cleaned title sources (meta tags, document short title, HTML title, first heading) and raises SkippedArticleError when all are invalid. Tests confirm the function skips articles when all title candidates are URL-like.
Title cleaning validation unit tests
apps/backend/tests/test_sitemap_crawler.py
Direct tests for clean_article_title() confirm it returns None for URL-like inputs and preserves normal text titles.
Test infrastructure and mock helpers
apps/backend/tests/conftest.py, apps/backend/tests/test_sitemap_crawler.py
Adds Python path setup in conftest to enable backend module imports. Defines nhn_source() helper and nhn_api_client() mock using httpx.MockTransport to simulate NHN Cloud API responses.
Development log documentation
docs/development-log.md
Documents the title validation changes, test commands, and verification results in the development log.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A crawler's tale, with titles clean,
No URLs where titles should gleam,
NHN suffixes dance with grace,
While URL-like strings find no place,
Tests ensure each heading's true,
And articles skip if none will do!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: correcting NHN Cloud title extraction logic to prevent URL-like values from being saved as article titles.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ai/nhn-title-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SmileJune SmileJune marked this pull request as ready for review June 1, 2026 05:31
@SmileJune SmileJune merged commit cf54345 into main Jun 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant